77 research outputs found
Query by String word spotting based on character bi-gram indexing
In this paper we propose a segmentation-free query by string word spotting
method. Both the documents and query strings are encoded using a recently
proposed word representa- tion that projects images and strings into a common
atribute space based on a pyramidal histogram of characters(PHOC). These
attribute models are learned using linear SVMs over the Fisher Vector
representation of the images along with the PHOC labels of the corresponding
strings. In order to search through the whole page, document regions are
indexed per character bi- gram using a similar attribute representation. On top
of that, we propose an integral image representation of the document using a
simplified version of the attribute model for efficient computation. Finally we
introduce a re-ranking step in order to boost retrieval performance. We show
state-of-the-art results for segmentation-free query by string word spotting in
single-writer and multi-writer standard datasetsComment: To be published in ICDAR201
Hierarchical multimodal transformers for Multi-Page DocVQA
Document Visual Question Answering (DocVQA) refers to the task of answering
questions from document images. Existing work on DocVQA only considers
single-page documents. However, in real scenarios documents are mostly composed
of multiple pages that should be processed altogether. In this work we extend
DocVQA to the multi-page scenario. For that, we first create a new dataset,
MP-DocVQA, where questions are posed over multi-page documents instead of
single pages. Second, we propose a new hierarchical method, Hi-VT5, based on
the T5 architecture, that overcomes the limitations of current methods to
process long multi-page documents. The proposed method is based on a
hierarchical transformer architecture where the encoder summarizes the most
relevant information of every page and then, the decoder takes this summarized
information to generate the final answer. Through extensive experimentation, we
demonstrate that our method is able, in a single stage, to answer the questions
and provide the page that contains the relevant information to find the answer,
which can be used as a kind of explainability measure
Report on the Second Symbol Recognition Contest
http://www.springer.com/lncsFollowing the experience of the first edition of the international symbol recognition contest held during GREC'03 in Barcelona, a second edition has been organized during GREC'05. In this paper, first, we bring to mind the general principles of both contests before presenting more specifically the details of this last edition. In particular, we describe the dataset used in the contest, the methods that took part in it, and the analysis of the results obtained by the participants. We conclude with a synthesis of the contributions and lacks of these two editions, and some leads for the organization of a forthcoming contest
Learning Cross-Modal Deep Embeddings for Multi-Object Image Retrieval using Text and Sketch
In this work we introduce a cross modal image retrieval system that allows
both text and sketch as input modalities for the query. A cross-modal deep
network architecture is formulated to jointly model the sketch and text input
modalities as well as the the image output modality, learning a common
embedding between text and images and between sketches and images. In addition,
an attention model is used to selectively focus the attention on the different
objects of the image, allowing for retrieval with multiple objects in the
query. Experiments show that the proposed method performs the best in both
single and multiple object image retrieval in standard datasets.Comment: Accepted at ICPR 201
Handwritten Word Spotting with Corrected Attributes
International audienceWe propose an approach to multi-writer word spotting, where the goal is to find a query word in a dataset comprised of document images. We propose an attributes-based approach that leads to a low-dimensional, fixed-length representation of the word images that is fast to compute and, especially, fast to compare. This approach naturally leads to an unified representation of word images and strings, which seamlessly allows one to indistinctly perform query-by-example, where the query is an image, and query-by-string, where the query is a string. We also propose a calibration scheme to correct the attributes scores based on Canonical Correlation Analysis that greatly improves the results on a challenging dataset. We test our approach on two public datasets showing state-of-the-art results
Comparing Graph Similarity for Graphical Recognition
The original publication is available at www.springerlink.com. 8th International Workshop, GREC 2009, La Rochelle, France, July 22-23, 2009. Selected PapersIn this paper we evaluate four graph distance measures. The analysis is performed for document retrieval tasks. For this aim, different kind of documents are used including line drawings (symbols), ancient documents (ornamental letters), shapes and trademark-logos. The experimental results show that the performance of each graph distance measure depends on the kind of data and the graph representation technique
- …